spec(012) Phase 1: structured action items + most-recent-verdict acceptance gate#198
Merged
Merged
Conversation
jeremymanning
added a commit
that referenced
this pull request
May 18, 2026
Updates the "How it works → The paper pipeline" section to describe the spec-012 convergence pipeline (structured action items, most-recent verdict gate, three-way severity routing, per-specialist re-review protocol, and arxiv-intake guardrail). Closes the last remaining task in the spec-012 task list (T053). With this commit, all 55 of 55 tasks are now landed on PR #198. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jeremymanning
added a commit
that referenced
this pull request
May 18, 2026
Updates the "How it works → The paper pipeline" section to describe the spec-012 convergence pipeline (structured action items, most-recent verdict gate, three-way severity routing, per-specialist re-review protocol, and arxiv-intake guardrail). Closes the last remaining task in the spec-012 task list (T053). With this commit, all 55 of 55 tasks are now landed on PR #198. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
f292689 to
280f8b6
Compare
…ptance gate
Implements the convergence-pipeline foundation for spec 012:
SCHEMA (T001-T009):
- New Stage enum values: PAPER_REVISION_IN_PROGRESS, READY_FOR_IMPLEMENTATION,
PAPER_REVISION_BLOCKED. Added to project-state.schema.yaml + lifecycle
ALLOWED_TRANSITIONS (additive; old transitions retained for back-compat).
- New ActionItem pydantic model (id, text, severity ∈ {writing,science,fatal}).
Stable IDs derived from canonicalize(text) → sha1[:12]; canonicalization
absorbs section/figure/table/equation refs + casing.
- ReviewRecord gains action_items field (default []). Validator: non-accept
verdicts under prompt_version >= 1.1.0 MUST include ≥1 action_item.
Legacy 1.0.x records are grandfathered.
- Project gains revision_spec_path field for the READY_FOR_IMPLEMENTATION flag.
PROMPTS (T010-T011):
- agents/prompts/paper_reviewer.md (lead) + 12 specialist prompts updated to
emit action_items block in YAML frontmatter.
- agents/prompts/_shared/rereview_block.md: shared re-review protocol snippet
(single source of truth). Used when prior reviews exist FOR THIS specialist.
- agents/registry.yaml: prompt_version bumped 1.0.0 → 1.1.0 for all 13
paper_reviewer entries.
REVIEWER (T012):
- paper_reviewer.py handle_response: normalizes action_items emitted by the
LLM (derives missing IDs via action_item_id()).
ACCEPTANCE GATE (T014-T017, US1):
- advancement.py: replaced "any-historical-accept" gate with most-recent
non-stale verdict per specialist (FR-001/002/003). Stale-hash reviews are
ignored. The redundant point threshold (PAPER_ACCEPT_THRESHOLD) is dropped
for the all-accept condition — when every specialist's most-recent is
accept, the project transitions to PAPER_ACCEPTED.
SEVERITY ROUTING (T018-T021, US4):
- advancement.py: max-severity across specialists drives routing.
- fatal → BRAINSTORMED with rejection rationale appended to the idea
record (via upstream_feedback.append_rejection_rationale).
- writing / science → legacy MINOR/MAJOR revision stages for now (the
auto-plan revision_planner is part of US2/US3, deferred to Phase 2).
- Back-compat: when records lack action_items (prompt_version 1.0.x),
fall back to the pre-spec-012 _winning_recommendation. PROJ-578 / etc.
continue to route correctly until they're re-reviewed under 1.1.0.
ARXIV-INTAKE GUARDRAIL (T040-T045, US7):
- New module src/llmxive/agents/upstream_feedback.py.
- is_arxiv_intake(project_dir): detects third-party arxiv submissions
(metadata.json present AND paper/specs/ absent).
- record_round(...): atomically appends a Round to
projects/<PROJ-ID>/upstream_feedback.yaml.
- append_rejection_rationale(...): annotates the idea record on BRAINSTORMED
transition (best-effort; defensive).
- advancement.py routes arxiv-intake projects to PAPER_ACCEPTED (with caveats
in upstream_feedback.yaml) or BRAINSTORMED — NEVER attempts to mutate
paper/source/.
SPEC ARTIFACTS:
- specs/012-paper-review-convergence/: spec.md, plan.md, research.md,
data-model.md, quickstart.md, 4 contracts, checklists/requirements.md,
tasks.md (55 tasks). /speckit-analyze produced 8 findings (1H/3M/2L);
all 8 fixed in iteration 1.
- CLAUDE.md updated to point at the new plan.
- contracts/project-state.schema.yaml: 3 new stage values + revision_spec_path.
TESTS:
- 39 new unit tests across test_action_item_schema.py,
test_review_record_action_items.py, test_advancement_convergence.py.
- Full unit suite (451 tests) passes.
DEFERRED to follow-up PRs:
- T022-T034: revision_planner (auto-plan 5-stage subprocess driver).
- T035-T039: paper_reviewer.py wiring the shared rereview snippet into
the prompt when prior reviews exist (the snippet is created in this PR;
the consumer logic is the follow-up).
- T046-T050: scheduler idempotency + unblock CLI + e2e convergence test.
- T052-T053: web dashboard rendering of new stage badges, README update.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When a specialist reviewer has ≥1 prior review record for THIS project, paper_reviewer.py now prepends the shared re-review block (from agents/prompts/_shared/rereview_block.md) to the user prompt, with the specialist's most-recent prior action_items substituted in. The block instructs the LLM to apply the two-question protocol (FR-014/015/016) instead of generating a fresh critique. A specialist with NO prior records continues to use the full-critique prompt (FR-017). This is the per-specialist toggle from clarification session Q2. Changes: - src/llmxive/state/reviews.py: prior_reviews_for_specialist() filters list_for() output to one specialist + sorts by reviewed_at ascending. - src/llmxive/agents/paper_reviewer.py build_messages: when prior reviews exist FOR THIS specialist, render the shared snippet with the most- recent prior's action_items as YAML, prepend it to the user prompt. - contracts/review-record.schema.yaml: action_items array added so old- record-validation doesn't reject the new field on serialization. - tests/unit/test_rereview_per_specialist_toggle.py: 7 new tests covering per-specialist filtering, sort order, snippet presence, no-priors path. Full unit suite (458 tests, +7) still passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the operator escape hatch and scheduler skip rules required by spec 012: - src/llmxive/cli.py: new subcommand `llmxive project unblock <PROJ-ID>` (FR-023). Refuses to no-op-unblock: requires the most-recent state/revisions/<PROJ-ID>/round-N.yaml file to be modified AFTER the project's recorded updated_at (mtime check). Transitions to PAPER_REVIEW by default; --to-minor transitions to PAPER_MINOR_REVISION. - src/llmxive/pipeline/scheduler.py: PAPER_REVISION_IN_PROGRESS, READY_FOR_IMPLEMENTATION, and PAPER_REVISION_BLOCKED added to _NEVER_PICK. FR-009's idempotency rule: while a project is being planned, the regular scheduler MUST NOT re-trigger work on it. The ready/blocked states are owned by dedicated agents (implementer + human respectively), not the regular tick-scheduler. - tests/unit/test_cli_project_unblock.py: 5 tests covering happy path, --to-minor flag, no-op-unblock refusal, wrong-stage refusal, missing round-file refusal. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3 integration tests in tests/integration/test_revision_in_progress_idempotency.py: - verify the three spec-012 stages are in _NEVER_PICK - verify a runnable project is preferred over an in-progress one - verify the scheduler returns None when every project is in a NEVER_PICK state Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When a home-grown paper enters PAPER_REVIEW with writing/science action
items (no fatal), advancement.py now transitions the project to
PAPER_REVISION_IN_PROGRESS and invokes revision_planner.run_revision_pipeline.
The planner produces a full revision-spec directory under
specs/auto-revisions/<PROJ-ID>/round-<N>/ containing spec.md, plan.md,
tasks.md, analyze-report.md, and result.yaml.
Implementation is DETERMINISTIC (v1): each of the 5 stage outputs is
generated directly from the consolidated action items (no LLM call).
The spec/plan/tasks artifacts are concrete enough that an implementer
agent can pick up the revision_spec_path and execute. A follow-up PR
replaces the deterministic generation with the full LLM-driven speckit
pipeline (speckit-{specify,clarify,plan,tasks,analyze}).
Public API contract is stable across v1 (deterministic) and v2 (LLM-driven):
run_revision_pipeline(project_id, action_items, *, revision_kind, repo_root)
-> RevisionSpecResult{revision_spec_path, final_outcome, stage_results, ...}
Defensive checks:
- ArxivIntakeError on arxiv-intake projects (advancement.py routes
them through upstream_feedback instead).
- RevisionPlanningError on FS/schema failures.
- On either error, advancement.py transitions to PAPER_REVISION_BLOCKED
so the operator notices.
state/revisions/index.yaml is also updated atomically so an implementer
agent can discover ready-for-implementation projects without scanning
the filesystem.
8 new unit tests in tests/unit/test_revision_planner.py cover the
5-artifact generation, action-item-to-task mapping, arxiv-intake
guardrail, science vs writing kinds, round-number incrementing, and
the index.yaml update.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… (T052)
- web_data.py: PAPER_REVISION_IN_PROGRESS, READY_FOR_IMPLEMENTATION,
PAPER_REVISION_BLOCKED added to _PHASE_GROUP_BY_STAGE (all → paper_review
phase). Without this, projects landing in the new states would be
rendered as "blocked" (the fallback group), which is misleading.
- _project_to_entry payload gains revision_spec_path (links to the
auto-planned revision spec dir when stage == READY_FOR_IMPLEMENTATION)
and upstream_feedback (summary of the arxiv-intake annotation).
- _upstream_feedback_summary() reads upstream_feedback.yaml and returns
{schema_version, round_count, latest_verdict_class, latest_action_item_count}.
None when the file is absent (most projects).
Regenerates web/data/projects.json with the new fields.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…gate (T050)
Adds the end-to-end convergence test required by SC-001 / T050. The test
covers the three terminal outcomes:
- All specialists accept → PAPER_ACCEPTED
- Writing-class action items → PAPER_REVISION_IN_PROGRESS → 5 artifacts
+ READY_FOR_IMPLEMENTATION
- Fatal-class action items → BRAINSTORMED + rejection rationale appended
to the idea record
Gated on LLMXIVE_REAL_TESTS=1 per the real-call test convention. The test
exercises pure-Python logic + real filesystem state (no Dartmouth calls
needed; the deterministic revision_planner emits artifacts directly).
ALSO fixes a defensive bug in _all_specialists_accept_most_recent:
previously, when `required` was empty (registry-load failure), the gate
trivially returned True — which meant any non-accept review on an
unconfigured registry would be incorrectly routed to PAPER_ACCEPTED.
New behavior:
- empty required + no records → False (unconfigured; refuse to advance)
- empty required + all-accept records → True (every reviewer that
recorded a verdict accepted; vacuously OK)
- empty required + any non-accept → False (severity branch takes over)
- non-empty required + records → standard per-specialist most-recent check
Two unit tests added in test_advancement_convergence.py to lock the new
behavior in (replacing the prior single test_empty_required_gate_passes_trivially).
Full unit suite (463+e2e) passes locally.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Updates the "How it works → The paper pipeline" section to describe the spec-012 convergence pipeline (structured action items, most-recent verdict gate, three-way severity routing, per-specialist re-review protocol, and arxiv-intake guardrail). Closes the last remaining task in the spec-012 task list (T053). With this commit, all 55 of 55 tasks are now landed on PR #198. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
280f8b6 to
5cbbdda
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Phase 1 of spec 012 (paper review convergence). Implements ~25 of 55 tasks: the foundational schema work + the most-recent-verdict acceptance gate + severity-based routing + arxiv-intake guardrail. The remaining 30 tasks (auto-plan revision pipeline + re-review protocol consumer logic + integration + polish) ship in follow-up PRs.
What this PR enables
The four already-passing arxiv-intake papers (PROJ-564 / 565 / 566 / 576) can now reach
PAPER_ACCEPTEDon the next paper-review cron tick — the all-accept gate is what was blocking them.Fatal-severity action items route the project to
BRAINSTORMEDwith a rejection rationale automatically appended to the idea record. PROJ-578's "GPT-5.4 / Claude Sonnet 4.5 / Gemini-3.1-Pro are unverifiable" finding would land it here (once its reviews are re-emitted under prompt_version 1.1.0).Arxiv-intake papers (third-party, frozen source) can never trigger a writing/science revision pipeline against
paper/source/— instead the consolidated action items land inprojects/<PROJ-ID>/upstream_feedback.yaml.Scope (what's IN this PR — ~25 tasks)
action_items;paper_reviewer.pyparses themBRAINSTORMED + rejection rationalefor fatalupstream_feedback.yaml,is_arxiv_intake,append_rejection_rationale)Deferred (~30 tasks for follow-up)
revision_planner.py— the 5-stage subprocess driver that auto-runsspeckit-{specify,clarify,plan,tasks,analyze}for revision specs. This is the biggest unbuilt piece; needs ~500 LOC + real-call tests.paper_reviewer.pywiring of the shared rereview snippet when prior reviews exist for THIS specialist. The snippet itself (agents/prompts/_shared/rereview_block.md) ships here; the consumer is the follow-up.llmxive project unblockCLI, full-cycle e2e real-call testPAPER_REVISION_IN_PROGRESS/READY_FOR_IMPLEMENTATION/PAPER_REVISION_BLOCKEDbadges, README updateThe advancement evaluator still routes legacy verdicts (prompt_version 1.0.x records with no action_items) through the pre-spec-012
_winning_recommendationpath so existing projects don't regress while reviews are gradually re-emitted under 1.1.0.Test plan
PAPER_ACCEPTEDafter specialists are re-prompted under 1.1.0🤖 Generated with Claude Code